5 research outputs found

    Localization of Sound Sources in a Room with One Microphone

    Get PDF
    Estimation of the location of sound sources is usually done using microphone arrays. Such settings provide an environment where we know the difference between the received signals among different microphones in the terms of phase or attenuation, which enables localization of the sound sources. In our solution we exploit the properties of the room transfer function in order to localize a sound source inside a room with only one microphone. The shape of the room and the position of the microphone are assumed to be known. The design guidelines and limitations of the sensing matrix are given. Implementation is based on the sparsity in the terms of voxels in a room that are occupied by a source. What is especially interesting about our solution is that we provide localization of the sound sources not only in the horizontal plane, but in the terms of the 3D coordinates inside the room

    Sparse and Parametric Modeling with Applications to Acoustics and Audio

    No full text
    Recent advances in signal processing, machine learning and deep learning with sparse intrinsic structure of data have paved the path for solving inverse problems in acoustics and audio. The main task of this thesis was to bridge the gap between the powerful mathematical tools and challenging problems in acoustics and audio. This thesis consists out of two main parts. The first part of the thesis focuses on the questions related to acoustic simulations that comply with the "real world" constraints and the acoustic data acquisition inside of closed spaces. The simulated and measured data is used to solve various types of inverse problems with underlying sparsity. By using the technique of compressed sensing, we estimate the room modes, localize sound sources in a room and also estimate room's geometry. The Finite Rate of Innovation technique is coupled with non-convex optimization for the task of blind deconvolution in the context of echo retrieval. We also invent a new statistical measure for the echo density for the purpose of detecting the type of acoustic environment from its acoustic impulse response, even beyond fully closed spaces. These types solutions can have an application in the blooming domain of virtual, augmented and mixed reality for sound compression and rendering. The second part of the thesis focuses on the recent trends in machine learning that are centered around deep learning. Large scale data acquisition of acoustic impulse responses is still a challenging and very expensive task. Also, the existing databases tend to be too heterogeneous to be merged, due to the lack of the standardization of the acquisition procedure, and also the available metadata tends to be incomplete. In order to keep up with the recent trends and avoid the difficulties that come from the lack of large scale acoustical data, the last part of research in this thesis has diverged from the rest and is devoted to deep learning applied to classification problems in audio with the focus on speech and environmental sounds. The learning procedure is parametrized, which results in an off-grid learning procedure for audio classification. Learned trends align with perceptual trends, which helps the interpretation of the achieved results

    MULAN: A Blind and Off-Grid Method for Multichannel Echo Retrieval

    Get PDF
    International audienceThis paper addresses the general problem of blind echo retrieval, i.e., given M sensors measuring in the discrete-time domain M mixtures of K delayed and attenuated copies of an unknown source signal, can the echo locations and weights be recovered? This problem has broad applications in fields such as sonars, seismol-ogy, ultrasounds or room acoustics. It belongs to the broader class of blind channel identification problems, which have been intensively studied in signal processing. Existing methods in the literature proceed in two steps: (i) blind estimation of sparse discrete-time filters and (ii) echo information retrieval by peak-picking on filters. The precision of these methods is fundamentally limited by the rate at which the signals are sampled: estimated echo locations are necessary on-grid, and since true locations never match the sampling grid, the weight estimation precision is impacted. This is the so-called basis-mismatch problem in compressed sensing. We propose a radically different approach to the problem, building on the framework of finite-rate-of-innovation sampling. The approach operates directly in the parameter-space of echo locations and weights, and enables near-exact blind and off-grid echo retrieval from discrete-time measurements. It is shown to outperform conventional methods by several orders of magnitude in precision

    Joint Estimation Of The Room Geometry And Modes With Compressed Sensing

    No full text
    Acoustical behavior of a room for a given position of microphone and sound source is usually described using the room impulse response. If we rely on the standard uniform sampling, the estimation of room impulse response for arbitrary positions in the room requires a large number of measurements. In order to lower the required sampling rate, some solutions have emerged that exploit the sparse representation of the room wavefield in the terms of plane waves in the low-frequency domain. The plane wave representation has a simple form in rectangular rooms. In our solution, we observe the basic axial modes of the wave vector grid for extraction of the room geometry and then we propagate the knowledge to higher order modes out of the low-pass version of the measurements. Estimation of the approximate structure of the k-space should lead to the reduction in the terms of number of required measurements and in the increase of the speed of the reconstruction without great losses of quality

    Evaluating audiovisual source separation in the context of video conferencing

    No full text
    Source separation involving mono-channel audio is a challenging problem, in particular for speech separation where source contributions overlap both in time and frequency. This task is of high interest for applications such as video conferencing. Recent progress in machine learning has shown that the combination of visual cues, coming from the video, can increase the source separation performance. Starting from a recently designed deep neural network, we assess its ability and robustness to separate the visible speakers’ speech from other interfering speeches or signals. We test it for different configuration of video recordings where the speaker’s face may not be fully visible. We also asses the performance of the network with respect to different sets of visual features from the speakers’ faces
    corecore